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ABSTRACT 

Motivation: Imaging genetics is an emerging field that studies the 
influence of genetic variation on brain structure and function. The 
major task is to examine the association between genetic markers 
such as single-nucleotide polymorphisms (SNPs) and quantitative 
traits (QTs) extracted from neuroimaging data. The complexity of 
these datasets has presented critical bioinformatics challenges that 
require new enabling tools. Sparse canonical correlation analysis 
(SCCA) is a bi-multivariate technique used in imaging genetics to iden- 
tify complex multi-SNP-multi-QT associations. However, most of the 
existing SCCA algorithms are designed using the soft thresholding 
method, which assumes that the input features are independent 
from one another. This assumption clearly does not hold for the ima- 
ging genetic data. In this article, we propose a new knowledge-guided 
SCCA algorithm (KG-SCCA) to overcome this limitation as well as 
improve learning results by incorporating valuable prior knowledge. 
Results: The proposed KG-SCCA method is able to model two types 
of prior knowledge: one as a group structure (e.g. linkage disequilib- 
rium blocks among SNPs) and the other as a network structure (e.g. 
gene co-expression network among brain regions). The new model 
incorporates these prior structures by introducing new regularization 
terms to encourage weight similarity between grouped or connected 
features. A new algorithm is designed to solve the KG-SCCA model 
without imposing the independence constraint on the input features. 
We demonstrate the effectiveness of our algorithm with both synthetic 
and real data. For real data, using an Alzheimer's disease (AD) cohort, 
we examine the imaging genetic associations between all SNPs in the 
APOE gene (i.e. top AD gene) and amyloid deposition measures 
among cortical regions (i.e. a major AD hallmark). In comparison 
with a widely used SCCA implementation, our KG-SCCA algorithm 
produces not only improved cross-validation performances but also 
biologically meaningful results. 
Availability: Software is freely available on request. 
Contact: shenli@iu.edu 



1 INTRODUCTION 

Brain imaging genetics is an emerging field that studies the 
influence of genetic variation on brain structure and function. 
Its major task is to examine the association between genetic 
markers such as single-nucleotide polymorphisms (SNPs) and 



'To whom correspondence should be addressed. 



quantitative traits (QTs) extracted from multimodal neuroima- 
ging data (e.g. anatomical, functional and molecular imaging 
scans). Given the well-known importance of gene and imaging 
phenotype in brain function, bridging these two factors and 
exploring their connections would lead to a better mechanistic 
understanding of normal or disordered brain functions. The 
complexity of these data, however, has presented critical bio- 
informatics challenges requiring new enabhng tools. Early stu- 
dies in imaging genetics typically focused on pairwise univariate 
analysis (Shen et al., 2010). Many recent studies turned to regres- 
sion analysis for exploring the joint effect of multiple SNPs on 
single or few QTs (Hibar et ai, 2011) and bi-multivariate ana- 
lyses for revealing complex multi-SNPs-multi-QTs associations 
(Chi et al., 2013; Lin et al., 2014; Vounou et al., 2010; Wan et al., 
2011). 

Canonical correlation analysis (CCA), a bi-multivariate 
method, has been apphed to imaging genetics apphcations. It 
aims to find the best linear transformation for imaging and gen- 
etics features so that the highest correlation between imaging and 
genetic components can be achieved. Based on the assumption 
that a real imaging genetic signal typically involves a small 
number of SNPs and QTs, sparse canonical correlation analysis 
(SCCA) has also been apphed in several imaging genetic studies 
by imposing the Lasso regularization term to yield sparse results 
(Chi et al., 2013; Lin et al., 2014; Wan et al., 2011). However, 
most existing SCCA algorithms are designed using the soft 
thresholding technique, which assumes that the input features 
are independent from one another (Tibshirani, 1996). This as- 
sumption clearly does not hold for the imaging genetic data [e.g. 
the existence of the structural and functional networks in the 
brain and the linkage disequihbrium (LD) blocks in the genome]. 
Directly ignoring the covariance structure in the data will inev- 
itably limit the capability of yielding optimal results. 

In this article, we propose a new knowledge-guided SCCA 
algorithm (KG-SCCA) to overcome this limitation as well as 
to aim for improving learning results by incorporating valuable 
prior knowledge. The proposed KG-SCCA method is able to 
model two types of prior knowledge: one as a group structure 
(e.g. LD blocks among SNPs) and the other as a network struc- 
ture (e.g. gene co-expression network among brain regions). The 
new model incorporates these prior structures by introducing 
new regularization terms to encourage similarity between 
grouped or connected features. A new algorithm is designed to 
solve the KG-SCCA model without imposing the independence 



© The Author 2014. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.Org/licenses/by-nc/3.0/), which permits 
non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com 



Transcriptome-Guided Amyloid Imaging Genetics via KG-SCCA 



constraint on the input features. We demonstrate the effective- 
ness of our algorithm with both synthetic and real data. For real 
data, using an Alzheimer's disease (AD) cohort, we examine the 
imaging genetic associations between all SNPs in the APOE gene 
(i.e. top AD gene) and amyloid deposition measures among cor- 
tical regions (i.e. a major AD hallmark). In comparison with a 
widely used SCCA implementation in the PMA software pack- 
age (http://cran.r-project.org/web/packages/PMA/) (Witten et aL, 
2009), our KG-SCCA algorithm produces improved cross- 
validation performances as well as biologically meaningful results. 

2 MATERIALS AND DATA SOURCES 

To demonstrate the proposed KG-SCCA algorithm, we apply it 
to an amyloid imaging genetic analysis in the study of AD. 
Deposition of amyloid-^ in the cerebral cortex is a major hall- 
mark in AD pathogenesis. Our prior studies (Ramanan et aL, 
2014; Swaminathan et aL, 2012) performed univariate genetic 
association analyses of amyloid measures in a few candidate cor- 
tical regions of interest (ROIs), and identified several promising 
hits including rs429358 in APOE, rs509208 in BCHE and 
rs7551288 in DHCR24. In this work, using the proposed KG- 
SCCA algorithm, we perform a bi-multivariate analysis to exam- 
ine the association between all the available SNPs (58 in total) in 
the APOE gene (i.e. the top genetic risk factor for late onset AD) 
and 78 ROIs across the entire cortex. We use two types of prior 
knowledge in this analysis: (i) a group structure is imposed to the 
SNP data using the LD block information (Fig. 4), and (ii) a 
network structure is imposed to the amyloid imaging data by 
computing an amyloid pathway-based gene co-expression net- 
work in the brain using Allen Human Brain Atlas (AHBA; 
Zeng et aL, 2012). Below, we first describe our amyloid imaging 
and geno typing data, and then discuss our method for creating 
the amyloid pathway-based gene co-expression network in the 
brain. 

2.1 Imaging and genotyping data 

The proposed algorithm, KG-SCCA, was empirically evaluated 
using the amyloid imaging and genotyping data obtained from 
the Alzheimer's Disease Neuroimaging Initiative (ADNI) data- 
base (adni.loni.usc.edu). One goal of ADNI has been to test 
whether serial magnetic resonance imaging (MRI), positron 
emission tomography (PET), other biological markers and cHn- 
ical and neuropsychological assessment can be combined to 
measure the progression of mild cognitive impairment (MCI) 
and early AD. For up-to-date information, see www.adni-info. 
org. Preprocessed [18F]Florbetapir PET scans (i.e. amyloid ima- 
ging data) were downloaded from LONI (adni.loni.usc.edu). 
Before downloading, images were averaged, ahgned to a stand- 
ard space, resampled to a standard image and voxel size, 
smoothed to a uniform resolution and normalized to a cerebellar 
gray matter reference region resulting in standardized uptake 
value ratio images as previously described (Jagust et aL, 2010). 
After downloading, the images were ahgned to each participant's 
same visit MRI scan and normalized to the Montreal 
Neurological Institute (MNI) space as 2x2x2 mm voxels 
using parameters from the MRI segmentation. ROI level amyl- 
oid measurements were further extracted based on the MarsBaR 



Table 1. Participant characteristics 



Subjects 


AD 


MCI 


HC 


Number 


28 


343 


196 


Gender (M/F) 


18/10 


203/140 


102/94 


Handedness (R/L) 


23/5 


309/34 


178/18 


Age (mean ± std) 


75.23 ±10.66 


71.92±7.47 


74.77 ±5.39 


Education (mean ± std) 


15.61 ±2.74 


15.99 ±2.75 


16.46 ±2.65 



AAL atlas. Genotype data of both ADNI-1 and ADNI-2/GO 
phases were also obtained from LONI (adni.loni.usc.edu). All 
the APOE SNPs were extracted based on the quahty controlled 
and imputed data combining two phases together. Only SNPs 
available in Illumina 610Quad and/or OmniExpress arrays were 
included in the analysis. As a result, we had 58 SNPs located 
within 10 LD blocks (Fig. 4) computed using HaploView 
(Barrett, 2009). A total of 568 non-Hispanic Caucasian partici- 
pants with both complete amyloid measurements and APOE 
SNPs were studied, including 28 AD, 343 MCI and 196 healthy 
control (HC) subjects (Table 1). Using the regression weights 
derived from the HC participants, amyloid and SNP measures 
were preadjusted for removing the effects of the baseline age, 
gender, education and handedness. 

2.2 Amyloid pathway-based gene co-expression network 
in the brain 

Because we examine cortical amyloid deposition in relation to 
genetic variation, we hypothesize that amyloid pathway-based 
gene co-expression profiles among cortical ROIs may provide 
valuable information in search for ^PO£'-related amyloid dis- 
tribution pattern in the cortex. Thus, we used the brain tran- 
scriptome data from the AHBA (Zeng et aL, 2012), coupled 
with 15 candidate genes from amyloid pathways studied in 
(Swaminathan et aL, 2012), to create such a brain network. 

Gene expression profiles across the whole human brain were 
downloaded from Allen Institute for Brain Science. One of their 
goals is to advance the research and knowledge about neurobio- 
logical conditions, with extensive mapping of whole-genome 
gene expression throughout the brain. Among various organ- 
isms, AHBA is one of the projects seeking to combine the gen- 
omics with the neuroanatomy to better understand the 
connection between genes and brain functioning. Gene expres- 
sion profiles in eight health human brains have been released, 
including two full brains and six right hemispheres. Details can 
be found in www.brain-map.org. 

Brain- wide expression data of all 15 amyloid-related candidate 
genes, reported in (Swaminathan et aL, 2012), were extracted 
from AHBA to construct the brain network. Because an early 
report indicated that individuals share as much as 95% gene 
expression profile (Zeng et aL, 2012), in this study, we only 
included two full brains (H0351-2201 and H0351-2002) to con- 
struct the co-expression network. First all the brain samples 
(--900) in AHBA were mapped to MarSBAR AAL atlas, 
which included 116 brain ROIs. According to Ramanan et aL 
(2014), cortical ROIs are typically believed to hold the amyloid 
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Fig. 1. Amyloid pathway-based gene co-expression networks among 78 
AAL cortical ROIs constructed from AHBA using different statistics (see 
different rows) for two individuals and their combination 
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Fig. 2. Network visualization by thresholding the connectivity matrix 
shown in the lower right corner of Figure 1, where edges correspond to 
matrix entries with values > 0.5 or < —0.5. The circle is symmetric (left 
measures on left and right measures on right), from top to bottom are 
frontal lobe, cingulate, parietal lobe, temporal lobe, occipital lobe, insula 
and sensory-motor cortex 



signals, whereas other ROIs hold similar amyloid measures across 
individuals. Thus, 39 pairs of bilateral cortical ROIs (78 in total), 
from frontal lobe, cingulate, parietal lobe, temporal lobe, occipital 
lobe, insula and sensory-motor cortex, were included in our ana- 
lysis. Correlation among ^^900 brain locations was first calculated 
based on the gene expression profile of 15 amyloid candidate 
genes. Due to many- to-one mapping from the brain locations to 
AAL ROIs, for each ROI, there are more than one connections, 
represented by correlations between two brain locations. 
Therefore, we calculated ROI-level correlations of two individuals 
in five ways: minimum, maximum, mean, standard deviation and 
median. In addition, the ROI correlation structure based on the 
combination of both individuals was also generated in the same 
way for comparison (Fig. 1). Clearly, for all five statistics, the 
pattern remains highly consistent across individuals and their 
combination. For simplicity, in the subsequent analysis, we 
adopt the brain connectivity matrix generated from the combin- 
ation sample using the median statistics (i.e. the panel in the lower 
right corner of Fig. 1). Figure 2 shows a network visualization of 
this matrix, where edges correspond to matrix entries with values 
> 0.5 or < -0.5. 



3 METHODS 

Now we present our KG-SCCA algorithm. We denote vectors as bold- 
face lowercase letters and matrices as boldface uppercase ones. For a 
given matrix M = (m,y), we denote its /-th row and y-th column as m' 
and my, respectively. Let X = {xi, x„} c gf^^ be the genotype data 



(SNP) and Y = {yj, y„} c gft^J' be the imaging QT data, where n is the 
number of participants, p and q are the numbers of SNPs and QTs, 
respectively. 

CCA seeks linear transformations of variables X and Y to achieve the 
maximal correlation between Xu and Yv, which can be formulated as: 

maxu^X^Yv s.t. u^X^Xu= 1, v^Y^Yv= 1 (1) 

u,v 

where u and v are canonical loadings or weights, reflecting the significance 
of each feature in the identified canonical correlation. 

Similar to many machine learning algorithms, overfitting could arise 
in CCA when the features outnumber the participants. In addition, the 
CCA outcome could spread non-trivial effects across all the features 
rather than only a few significant ones, making the results difficult to 
interpret. To address these issues, SCCA was proposed in (Witten et al., 
2009) by introducing penalty terms, Pi(u) < c\ and Pi{s) < C2, to regu- 
larize the weights, as shown in Equation (2). 

maxu^X^Yv 

(2) 

s.t. ||Xu||2 = 1, ||Yv||2 = 1, Pi(u) < Ci, P2(V) < C2 

Here the objective function is bilinear in u and v: when u is fixed, it is 
linear in v and vice versa. But due to the L2 equality, with u or v fixed, the 
constraints are not convex. This can be solved by reformulating the L2 
equality into inequality as ||Xu||2 < 1 and ||Yv||2 < 1. For easy compu- 
tation. Equation (2) is commonly rewritten in its Lagrangian form. 

maxu^X^Yv - ^ ||Xu||^ - ^ ||Yy||2 - ^^^^(u) - ^^^M (3) 

Witten et al. (2009) and Witten and Tibshirani (2009) explored two 
penalty forms, Lx penalty and the chain structured fused Lasso penalty. 
L\ penalty imposes sparsity on both u and v and assumes that each ca- 
nonical correlation involves only a few features from X and Y. The fused 
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Lasso penalty promotes the smoothness of weight vectors and encourages 
neighboring features to be selected together. To incorporate other struc- 
tures, group- and network-guided penalties were introduced (Chen and 
Liu, 2012; Chen et al., 2013). As mentioned earlier, most of these methods 
were designed using the soft thresholding technique, which was first pro- 
posed to solve Lasso problem when the features were independent from 
each other (Tibshirani, 1996). This condition does not hold in imaging 
genetics data. Thus, direct application of those methods into imaging 
genetics studies limits the capability of yielding optimal solutions. 
Below, we first present our KG-SCCA model and then present an effect- 
ive KG-SCCA algorithm without using the soft thresholding strategy. 

Brain has been studied as a complicated network. The SNP data have 
structures like LD blocks. Given these prior knowledge, we propose the 
following KG-SCCA model by introducing two penalty terms for genetic 
loadings u and imaging loading v, respectively. 



i^i=llu|| 



Ilulli 



A:i = ly ieTTk^ 



(4) 



= Al]llu''Ml2 + ^illu|li, 

kx = \ 

P2 = \\y\\N = ^2Y. I - 'iSn{}Vij)Vj\\l + ^2 1 |V| 1 1 



= ;g2l|Cv||2 + ^2lMll. 

In penalty P\i\\), SNPs are partitioned into Ki groups lli = [jik^ p 
such that G TTy^i , and nik^ is the number of SNPs in 7ik^ . While the 

group term ^ I |u^M I2 helps select all the SNPs in relevant LD blocks, 

^1 = 1 

Li penalty manages to suppress those non-signals within selected LD 
blocks. The P\{n) penalty is essentially the group Lasso penalty applied 
to the CCA framework. 

Penalty P2iy) applies the network-guided constraint to encourage the 
joint selection of 'connected' features (i.e. their connectivity matrix entry 
having a high weight) as well as uses Li to impose global sparsity. E is 
the set of all possible imaging QT pairs and l^"! is the total number of 
QT pairs. C g ai'*^'^^ is defined as follows. The row of C is indexed 
by all pairs e e {l, q},j e {l, ...q}, i<j}, C(^j),, = Wy and 

C(ij),j = sign(wjj)wij. riwij) provide the fusion effect that promotes similarity 
between Vj and Vj of related features. In this article, we use r(wy) = wj-. With 
sign(wij) we can have positively related features being pulled together and 
on the other hand the negatively related features being fused with opposite 
direction. Thus, for strongly connected features with a large fusion effect, 
they tend to be jointly selected or jointly not selected. 

In this work, as mentioned earlier, we formed the group structure for 
the SNP data by partitioning them using LD blocks generated by 
HaploView (Barrett, 2009). We formed the network structure for the 
amyloid imaging data by constructing amyloid pathway-based gene co- 
expression network using AHBA. Because the model could be easily ex- 
tended to estimate multiple canonical variables, we only focus on creating 
the first pair of canonical variables in this article. 



Algorithm 1 Knowledge-guided SCCA (KG-SCCA) 
Require: 

X = {xi, x„}, Y = {yi, y,J, group and network structures 
Ensure: 

Canonical vectors u and v. 

1 

2: 
3 
4: 



t= I, Initialize u, e g 9^^^^; 

while not converge do 
Calculate Bi, = ^Yv, 

Calculate the block diagonal matrix Di^ and D2, 



u,+ i =(X^X+ ^Di, + ^D2,)-^X^Bi,; 
Scale U/+1 so that uf+^X^Xu^+i = 1; 
Calculate B2, = ^Xu?+i; 
Calculate the block diagonal matrix D4/, 
v,+ i =(Y^Y+ |D3 + |D4,)-^Y^B2,; 
Scale Vf+i so that vf+^Y^Yv^+i = 1; 
t = t+l. 
end while 



We now present our algorithm to solve this model without using soft 
thresholding approach. By fixing u and v, respectively, we will have two 
convex problems shown in Equation (5). 



max u^X^Yv - y | |Xu| ij-^i^l |u^ 



lulli 



(5) 



max u^X^Yv - ^ 1 1 Yv| |^ - ^ | |Cv| |^ - ^2 1 |v| | 



Let Bi = ^ Yv and B2 = ^^u, the above problems can be reformulated 



to Equation (6): 



min -||Xu-Bi||2+^ V||U^MI2+-Ilu|li 
" 2 Kuf-i Ki 



min -||Yy-B2||^+|^||CY||^+^||y||i 



(6) 



Here, while u can be solved by the G-SMuRFS method proposed in 
(Wang et al., 2012), optimization of v can be achieved by the network- 
guided ^2,1 regression method proposed in (Yan et al., 2013). In both 
solutions, a smooth approximation has been estimated for group L2,i and 
Li terms by including an extremely small value. The solution for u and v 
in each iteration step is as follows: 



u = (X^X+^Di + ^D2)-iX^Bi, 
Ki Ki 



(7) 



v = (Y^Y+ ^D3+ — 

K2 Yi 



D4)- Y^B2, 



where Di is a block diagonal matrix with the k-th diagonal block as 
^^pq^Iy^; Ik is an identity matrix with size of m^; nik is the total feature 
number in group k; D2 is a diagonal matrix with the /-th diagonal element 
as 2|[s%' 1^3 = C^C is a matrix in which each row integrates all the neigh- 
boring relationships (e.g. for the /-th row, it is the sum of all the rows in a 
whose z-th element is not zero); and D4 is a diagonal matrix with the /-th 
diagonal element as 2|f^- Algorithm 1 summarizes the KG-SCCA opti- 
mization procedure. Further details on how to solve for two objectives in 
Equation (6) are available in (Wang et al, 2012) and (Yan et al, 2013), 
respectively. 

In Algorithm 1, six parameters Yi, ^2^^\^ ^2 need to be tuned 

to control the global sparsity as well as structured group or network con- 
straints. Chen and Liu (2012) studied a similar problem using a different 
method, and found that their results were insensitive to , 72 settings. 
Following their observation, we set yi and y2 to 1 for simplicity. Nested 
cross-validation can be used for parameter selection but will be extremely 
time-consuming for the remaining four parameters. Thus, we followed the 
strategy proposed in (Lin et al, 2014): parameters /3i , ^2 controlling struc- 
tural constraints were first tuned without considering sparsity constraints. 
Then based on the obtained optimal /3i, /32, another nested cross-valid- 
ation was performed to acquire the optimal 61,62- 



4 EXPERIMENTAL RESULTS AND DISCUSSIONS 

We performed comparative studies between the proposed 
KG-SCCA algorithm and a widely used SCCA implementation 
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Fig. 3. Five-fold trained weights of u and v. Ground truth of u and v are shown in the most left two panels. KG-SCCA results (top row) and PMA results 
(bottom row) are shown in the remaining panels, corresponding to true correlation coefficients (CCs) ranging from 0.6 to 1 .0. For each panel pair, the 
five estimated u values are shown on the left panel, and the five estimated v values are shown on the right panel 



Table 2. Five-fold cross-validation performance on synthetic data: mean ± std is shown for estimated correlation coefficients and AUC of the test data 
using the trained model 



True Correlation coefficients (CC) AUC 



CC KG-SCCA PMA P KG-SCCA:u PMA:u P KG-SCCA:v PMA:v 



0.60 


0.56 ±0.12 


0.31 ±0.14 


0.64 


0.56 ±0.1 


0.51 ±0.12 


0.70 


0.64 ±0.1 


0.53±0.1 


0.77 


0.7±0.14 


0.6±0.14 


0.85 


0.76 ±0.08 


0.65 ±0.1 


0.95 


0.87 ±0.04 


0.67 ±0.09 


1.00 


0.92 ±0.04 


0.71 ±0.06 



2.19E-03 0.83±0.08 0.64±0.02 

2.32E-02 0.96 ±0.04 0.65 ±0.01 

1.27E-05 0.99 ±0.01 0.62 ±0. 

6.62E-03 0.99 ±0.01 0.62 ±0. 

1.02E-04 0.98 ±0.03 0.63 ±0.01 

1.19E-03 1.00 ±0.00 0.63 ±0.01 

2.46E-04 1.00 ±0.00 0.64 ±0.01 



3.36E-03 1.0 ±0.00 1.0 ±0.00 

2.20E-05 1.0 ±0.00 1.0 ±0.00 

6.21E-08 1.0 ±0.00 1.0 ±0.00 

9.67E-09 1.0 ±0.00 1.0 ±0.00 

4.57E-06 1.0 ±0.00 1.0 ±0.00 

1.39E-08 1.0 ±0.00 1.0 ±0.00 

4.02E-08 1.0 ±0.00 1.0 ±0.00 



Note. P-value of paired r-test between KG-SCCA and PMA results are also shown. 

in the PMA package (http://cran.r-project.org/web/packages/ 
PMA/) (Witten et al., 2009). For PMA experiments, the SCCA 
parameters were automatically tuned using a permutation 
scheme provided in PMA. Below we report our empirical results 
using both synthetic data and real imaging genetics data. 

4.1 Results on simulation data 

Because it was not straightforward to manually construct a 
dataset with a network structure, we simulated group structures 
for both datasets and then converted them into network struc- 
tures for one dataset by connecting all the pairs within each 
group. Synthetic data (n = 200, p = 200, q=150) with diagonal 
block structure was generated with the following procedure: (i) 
Random positive definite covariance matrix M with non-over- 
lapping group structure was created, where correlations range 
from 0.6 to 1 within group and are set to 0 between groups, 
(ii) Dataset X with covariance structure M was calculated 
through Cholesky decomposition, (iii) Repeat Steps 1 and 2 to 
generate another dataset Y. (iv) With assigned canonical loadings 
of X, we calculated the first component Xu. (v) Given a desired 
correlation between components, we calculated the second com- 
ponent Yv. (vi) For simplicity, in this article, only one group in Y 



was assigned to have signals. Therefore, based on predefined 
canonical loadings of Y and component Yv, final obtained 
group signals, added with some white noises (Signal to Noise 
Ratio (SNR) = 0.5), will replace the data in original dataset Y. 
By repeating this procedure we generated seven datasets with 
correlation levels from 0.6 to 1. The canonical loadings and 
group structure remained the same across all the datasets. 

KG-SCCA and PMA have been both tested on all seven data- 
sets. All the regularization parameters were optimally tuned 
using a grid search from 10~^ to 10^ through nested 5-fold 
cross-validation, as mentioned before. The true and estimated 
canonical loadings for both X and Y were shown in Figure 3. 
Owing to the difference in normalization and optimization pro- 
cedure, the weights yielded by KG-SCCA and PMA showed 
different scales. Yet, the overall profile of the estimated u and 
V values from KG-SCCA kept consistent with the ground truth 
across the entire range of tested correlation strengths (from 0.6 to 
1.0), whereas PMA was only capable of identifying an incom- 
plete portion of all the signals. Furthermore, we also examined 
the correlation in the test set computed using the learned models 
from the training data for both methods. The left part of Table 2 
demonstrated that KG-SCCA outperformed PMA consistently 
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Table 3. Five-fold cross validation results on real data: the models learned from the training data were used to estimate the correlation coefficients 
between canonical components for both training and testing sets 



Method Train Test 



fl f2 f3 f4 f5 Mean fl f2 f3 f4 f5 Mean 



KG-SCCA 



PMA 



expl 


0.471 


0.448 


0.475 


0.451 


0.46 


0.461 


0.431 


0.515 


0.401 


0.417 


0.459 


0.445 


exp2 


0.476 


0.453 


0.454 


0.476 


0.461 


0.464 


0.402 


0.505 


0.503 


0.401 


0.458 


0.454 


exp3 


0.476 


0.474 


0.474 


0.468 


0.402 


0.459 


0.408 


0.393 


0.413 


0.435 


0.565 


0.443 


exp4 


0.468 


0.466 


0.459 


0.46 


0.466 


0.464 


0.441 


0.409 


0.47 


0.476 


0.445 


0.448 


exp5 


0.49 


0.502 


0.434 


0.449 


0.447 


0.464 


0.35 


0.297 


0.584 


0.527 


0.528 


0.457 


expl 


0.439 


0.418 


0.438 


0.438 


0.426 


0.432 


0.368 


0.45 


0.398 


0.379 


0.439 


0.407 


exp2 


0.444 


0.416 


0.425 


0.436 


0.432 


0.431 


0.354 


0.463 


0.449 


0.399 


0.416 


0.416 


exp3 


0.442 


0.445 


0.439 


0.427 


0.398 


0.43 


0.382 


0.341 


0.382 


0.432 


0.544 


0.416 


exp4 


0.434 


0.44 


0.425 


0.427 


0.431 


0.432 


0.414 


0.363 


0.445 


0.438 


0.415 


0.415 


exp5 


0.459 


0.462 


0.406 


0.416 


0.411 
P-value 


0.431 
3.08E-6 


0.288 


0.287 


0.517 


0.486 


0.501 
P-value 


0.416 
8.07E-5 



Note. P-values of paired r-tests were obtained for comparing KG-SCCA and PMA results. 



and significantly, and it could accurately reveal the embedded 
true correlation even in the test data. The right part of Table 2 
demonstrated the sensitivity and specificity performance using 
area under ROC (AUC), where KG-SCCA also significantly 
outperformed PMA no matter whether the correlation was 
weak or strong in u. Because v is relatively simple structured, 
both KG-SCCA and PMA can restore the signals without any 
loss. From the above results, it is also observed that KG-SCCA 
could identify the correlations and signal locations not only more 
accurately but also more stably. 

4.2 Results on real imaging genetic data 

Both KG-SCCA and PMA have been performed on real amyloid 
imaging and APOE genetics data. Similar to previous analysis, 
5-fold nested cross-vahdation was applied to optimally tune the 
parameters. Five experiments were performed with five different 
partitions to eliminate the bias. For each single experiment, the 
same partition was used for both KG-SCCA and PMA. Table 3 
shows both the training and test performances of KG-SCCA and 
PMA in all five folds of five experiments. Both methods demon- 
strated stable results across five trials. KG-SCCA was observed 
to outperform the PMA in every single experiment on both 
training and test performance. Paired r-test was performed to 
compare the performance across five experiments, and KG- 
SCCA outperformed PMA significantly in both training 
(P = 3.08E-6) and test cases (P = 8.07E-5). We also tested two 
simphfied KG-SCCA models: one with only the penalty term for 
the LD structure and the other with only the penalty term for the 
network structure. Interestingly, both performed similarly to the 
original KG-SCCA, and significantly outperformed PMA. 

Figure 4 demonstrates the canonical loadings trained from 
5-fold cross-vahdation in one experiment, suggesting relevant 
genetic (top panel) and imaging (bottom panel) markers. 
Although LD block constraints were imposed on relevant SNP 
markers, Lx penalty managed to exclude irrelevant signals. Only 
APOE e4 SNP (rs429358) was identified to be associated with 
amyloid accumulations in the brain. PMA also achieved a similar 



pattern as KG-SCCA, but including a few additional SNPs from 
multiple LD blocks. The bottom panel of Figure 4 shows the 
canonical loading for the imaging data. Both methods identified 
similar imaging patterns, which are in accordance with prior 
findings (Ramanan et aL, 2014). Figure 5 shows a brain map 
of canonical loadings generated by KG-SCCA. 



5 CONCLUSIONS 

We have performed a brain imaging genetics study to explore the 
relationship between brain-wide amyloid accumulation and gen- 
etic variations in the APOE gene. Because most existing SCCA 
algorithms are designed using the soft thresholding technique, 
which assumes independence among data features, direct apph- 
cation of these methods into brain imaging genetics study cannot 
yield optimal results owing to the correlated imaging and genetic 
features. We have proposed a novel KG-SCCA algorithm, which 
not only removes the above independence assumption, but also 
can model both the group-like and network-hke prior knowledge 
in the data to produce improved learning results. A comparative 
study has been performed between KG-SCCA and PMA (a 
widely used SCCA implementation) on both synthetic and real 
data. The promising empirical results demonstrated that KG- 
SCCA significantly outperformed PMA in both cases. 
Furthermore, KG-SCCA could accurately recover the true sig- 
nals from the synthetic data, as well as yield improved canonical 
correlation performances and biologically meaningful findings 
from real data. This study is an initial attempt to remove the 
feature independence assumption many existing SCCA methods 
have. The empirical studies designed here are targeted to identify 
relatively clean and simple multi-SNP-multi-QT correlations. 
Given only 58 SNPs analyzed here, this work is not a demon- 
stration of a genome- wide analysis. Comparison with other com- 
plex SCCA models, building scalable KG-SCCA models, and 
applications to more complex imaging genetic tasks warrant fur- 
ther investigation. 
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Fig. 4. Five-fold trained weights of u (top panel) and v (bottom panel). KG-SCCA results and PMA results are shown for each panel. For each of KG- 
SCCA and PMA imaging results (i.e. the bottom panel), the top and bottom rows correspond to left and right hemispheres, respectively 
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Fig. 5. Mapping canonical loading generated by KG-SCCA onto the 
brain 
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